Estimation Of Distribution Algorithm
   HOME

TheInfoList



OR:

''Estimation of distribution algorithms'' (EDAs), sometimes called ''probabilistic model-building genetic algorithms'' (PMBGAs), are
stochastic optimization Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective funct ...
methods that guide the search for the optimum by building and sampling explicit probabilistic models of promising candidate solutions. Optimization is viewed as a series of incremental updates of a probabilistic model, starting with the model encoding an uninformative prior over admissible solutions and ending with the model that generates only the global optima. EDAs belong to the class of
evolutionary algorithms In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...
. The main difference between EDAs and most conventional evolutionary algorithms is that evolutionary algorithms generate new candidate solutions using an ''implicit'' distribution defined by one or more variation operators, whereas EDAs use an ''explicit'' probability distribution encoded by a
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
, a
multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
, or another model class. Similarly as other evolutionary algorithms, EDAs can be used to solve optimization problems defined over a number of representations from vectors to
LISP A lisp is a speech impairment in which a person misarticulates sibilants (, , , , , , , ). These misarticulations often result in unclear speech. Types * A frontal lisp occurs when the tongue is placed anterior to the target. Interdental lisping ...
style S expressions, and the quality of candidate solutions is often evaluated using one or more objective functions. The general procedure of an EDA is outlined in the following: ''t'' := 0 initialize model M(0) to represent uniform distribution over admissible solutions while (termination criteria not met) do ''P'' := generate N>0 candidate solutions by sampling M(''t'') ''F'' := evaluate all candidate solutions in ''P'' M(t + 1) := adjust_model(''P'', ''F'', M(''t'')) ''t'' := ''t'' + 1 Using explicit probabilistic models in optimization allowed EDAs to feasibly solve optimization problems that were notoriously difficult for most conventional evolutionary algorithms and traditional optimization techniques, such as problems with high levels of
epistasis Epistasis is a phenomenon in genetics in which the effect of a gene mutation is dependent on the presence or absence of mutations in one or more other genes, respectively termed modifier genes. In other words, the effect of the mutation is dep ...
. Nonetheless, the advantage of EDAs is also that these algorithms provide an optimization practitioner with a series of probabilistic models that reveal a lot of information about the problem being solved. This information can in turn be used to design problem-specific neighborhood operators for local search, to bias future runs of EDAs on a similar problem, or to create an efficient computational model of the problem. For example, if the population is represented by bit strings of length 4, the EDA can represent the population of promising solution using a single vector of four probabilities (p1, p2, p3, p4) where each component of p defines the probability of that position being a 1. Using this probability vector it is possible to create an arbitrary number of candidate solutions.


Estimation of distribution algorithms (EDAs)

This section describes the models built by some well known EDAs of different levels of complexity. It is always assumed a population P(t) at the generation t, a selection operator S, a model-building operator \alpha and a sampling operator \beta.


Univariate factorizations

The most simple EDAs assume that decision variables are independent, i.e. p(X_1,X_2) = p(X_1)\cdot p(X_2). Therefore, univariate EDAs rely only on univariate statistics and multivariate distributions must be factorized as the product of N univariate probability distributions, D_\text := p(X_1,\dots,X_N) = \prod_^N p(X_i). Such factorizations are used in many different EDAs, next we describe some of them.


Univariate marginal distribution algorithm (UMDA)

The UMDA is a simple EDA that uses an operator \alpha_ to estimate marginal probabilities from a selected population S(P(t)). By assuming S(P(t)) contain \lambda elements, \alpha_ produces probabilities: p_(X_i) = \dfrac \sum_ x_i,~\forall i\in 1,2,\dots,N. Every UMDA step can be described as follows D(t+1) = \alpha_\text \circ S \circ \beta_(D(t)).


Population-based incremental learning In computer science and machine learning, population-based incremental learning (PBIL) is an optimization algorithm, and an estimation of distribution algorithm. This is a type of genetic algorithm where the genotype of an entire population (proba ...
(PBIL)

The PBIL, represents the population implicitly by its model, from which it samples new solutions and updates the model. At each generation, \mu individuals are sampled and \lambda\leq \mu are selected. Such individuals are then used to update the model as follows p_(X_i) = (1- \gamma) p_(X_i) + (\gamma/\lambda) \sum_ x_i,~\forall i\in 1,2,\dots,N, where \gamma\in(0,1] is a parameter defining the
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...
, a small value determines that the previous model p_t(X_i) should be only slightly modified by the new solutions sampled. PBIL can be described as D(t+1) = \alpha_\text \circ S \circ \beta_\mu(D(t))


Compact genetic algorithm (cGA)

The CGA, also relies on the implicit populations defined by univariate distributions. At each generation t, two individuals x,y are sampled, P(t)=\beta_2(D(t)). The population P(t) is then sorted in decreasing order of fitness, S_(P(t)), with u being the best and v being the worst solution. The CGA estimates univariate probabilities as follows p_(X_i) = p_t(X_i) + \gamma (u_i - v_i), \quad\forall i\in 1,2,\dots,N, where, \gamma\in(0,1] is a constant defining the
learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...
, usually set to \gamma=1/N. The CGA can be defined as D(t+1) = \alpha_\text \circ S_ \circ \beta_2(D(t))


Bivariate factorizations

Although univariate models can be computed efficiently, in many cases they are not representative enough to provide better performance than GAs. In order to overcome such a drawback, the use of bivariate factorizations was proposed in the EDA community, in which dependencies between pairs of variables could be modeled. A bivariate factorization can be defined as follows, where \pi_i contains a possible variable dependent to X_i, i.e. , \pi_i, =1. D_\text := p(X_1,\dots,X_N) = \prod_^ p(X_i, \pi_i). Bivariate and multivariate distributions are usually represented as probabilistic
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
s (graphs), in which edges denote statistical dependencies (or conditional probabilities) and vertices denote variables. To learn the structure of a PGM from data linkage-learning is employed.


Mutual information maximizing input clustering (MIMIC)

The MIMIC factorizes the
joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
in a chain-like model representing successive dependencies between variables. It finds a permutation of the decision variables, r : i \mapsto j, such that x_x_,\dots,x_ minimizes the Kullback-Leibler divergence in relation to the true probability distribution, i.e. \pi_ = \. MIMIC models a distribution p_(X_1,\dots,X_N) = p_t(X_) \prod_^ p_t(X_, X_). New solutions are sampled from the leftmost to the rightmost variable, the first is generated independently and the others according to conditional probabilities. Since the estimated distribution must be recomputed each generation, MIMIC uses concrete populations in the following way P(t+1) = \beta_\mu \circ \alpha_\text \circ S(P(t)).


Bivariate marginal distribution algorithm (BMDA)

The BMDA factorizes the joint probability distribution in bivariate distributions. First, a randomly chosen variable is added as a node in a graph, the most dependent variable to one of those in the graph is chosen among those not yet in the graph, this procedure is repeated until no remaining variable depends on any variable in the graph (verified according to a threshold value). The resulting model is a forest with multiple trees rooted at nodes \Upsilon_t. Considering I_t the non-root variables, BMDA estimates a factorized distribution in which the root variables can be sampled independently, whereas all the others must be conditioned to the parent variable \pi_i. p_(X_1,\dots,X_N) = \prod_ p_t(X_i) \cdot \prod_ p_t(X_i , \pi_i). Each step of BMDA is defined as follows P(t+1) = \beta_\mu \circ \alpha_\text \circ S(P(t)).


Multivariate factorizations

The next stage of EDAs development was the use of multivariate factorizations. In this case, the joint probability distribution is usually factorized in a number of components of limited size , \pi_i, \leq K,~\forall i\in 1,2,\dots,N. p(X_1,\dots,X_N) = \prod_^ p(X_i, \pi_i) The learning of PGMs encoding multivariate distributions is a computationally expensive task, therefore, it is usual for EDAs to estimate multivariate statistics from bivariate statistics. Such relaxation allows PGM to be built in polynomial time in N; however, it also limits the generality of such EDAs.


Extended compact genetic algorithm (eCGA)

The ECGA was one of the first EDA to employ multivariate factorizations, in which high-order dependencies among decision variables can be modeled. Its approach factorizes the joint probability distribution in the product of multivariate marginal distributions. Assume T_\text=\ is a set of subsets, in which every \tau\in T_\text is a linkage set, containing , \tau, \leq K variables. The factorized joint probability distribution is represented as follows p(X_1,\dots,X_N) = \prod_ p(\tau). The ECGA popularized the term "linkage-learning" as denoting procedures that identify linkage sets. Its linkage-learning procedure relies on two measures: (1) the Model Complexity (MC) and (2) the Compressed Population Complexity (CPC). The MC quantifies the model representation size in terms of number of bits required to store all the marginal probabilities MC = \log_2 (\lambda+1) \sum_ (2^), The CPC, on the other hand, quantifies the data compression in terms of entropy of the marginal distribution over all partitions, where \lambda is the selected population size, , \tau, is the number of decision variables in the linkage set \tau and H(\tau) is the joint entropy of the variables in \tau CPC = \lambda \sum_ H(\tau). The linkage-learning in ECGA works as follows: (1) Insert each variable in a cluster, (2) compute CCC = MC + CPC of the current linkage sets, (3) verify the increase on CCC provided by joining pairs of clusters, (4) effectively joins those clusters with highest CCC improvement. This procedure is repeated until no CCC improvements are possible and produces a linkage model T_\text. The ECGA works with concrete populations, therefore, using the factorized distribution modeled by ECGA, it can be described as P(t+1) = \beta_\mu \circ \alpha_\text \circ S(P(t))


Bayesian optimization algorithm (BOA)

The BOA uses Bayesian networks to model and sample promising solutions. Bayesian networks are directed acyclic graphs, with nodes representing variables and edges representing conditional probabilities between pair of variables. The value of a variable x_i can be conditioned on a maximum of K other variables, defined in \pi_i. BOA builds a PGM encoding a factorized joint distribution, in which the parameters of the network, i.e. the conditional probabilities, are estimated from the selected population using the maximum likelihood estimator. p(X_1,X_2,\dots,X_N)=\prod_^p(X_i, \pi_). The Bayesian network structure, on the other hand, must be built iteratively (linkage-learning). It starts with a network without edges and, at each step, adds the edge which better improves some scoring metric (e.g. Bayesian information criterion (BIC) or Bayesian-Dirichlet metric with likelihood equivalence (BDe)). The scoring metric evaluates the network structure according to its accuracy in modeling the selected population. From the built network, BOA samples new promising solutions as follows: (1) it computes the ancestral ordering for each variable, each node being preceded by its parents; (2) each variable is sampled conditionally to its parents. Given such scenario, every BOA step can be defined as P(t+1) = \beta_\mu \circ \alpha_\text \circ S(P(t))


Linkage-tree Genetic Algorithm (LTGA)

The LTGA differs from most EDA in the sense it does not explicitly model a probability distribution but only a linkage model, called linkage-tree. A linkage T is a set of linkage sets with no probability distribution associated, therefore, there is no way to sample new solutions directly from T. The linkage model is a linkage-tree produced stored as a
Family of sets In set theory and related branches of mathematics, a collection F of subsets of a given set S is called a family of subsets of S, or a family of sets over S. More generally, a collection of any sets whatsoever is called a family of sets, set fam ...
(FOS). T_\text = \. The linkage-tree learning procedure is a
hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...
algorithm, which work as follows. At each step the two ''closest'' clusters i and j are merged, this procedure repeats until only one cluster remains, each subtree is stored as a subset \tau\in T_\text. The LTGA uses T_\text to guide an "optimal mixing" procedure which resembles a recombination operator but only accepts improving moves. We denote it as R_\text, where the notation x
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
gets y
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
/math> indicates the transfer of the genetic material indexed by \tau from y to x. Input: A family of subsets T_\text and a population P(t) Output: A population P(t+1). for each x_i in P(t) do for each \tau in T_\text do choose a random x_j\in P(t) : x_i\neq x_j f_ := f(x_i) x_i
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
/math>:= x_j
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
/math> if f(x_i) \leq f_ then x_i
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
= x_j
tau Tau (uppercase Τ, lowercase τ, or \boldsymbol\tau; el, ταυ ) is the 19th letter of the Greek alphabet, representing the voiceless dental or alveolar plosive . In the system of Greek numerals, it has a value of 300. The name in English ...
/math> return P(t) The LTGA does not implement typical selection operators, instead, selection is performed during recombination. Similar ideas have been usually applied into local-search heuristics and, in this sense, the LTGA can be seen as an hybrid method. In summary, one step of the LTGA is defined as P(t+1) = R_(P(t)) \circ \alpha_\text (P(t))


Other

* Probability collectives (PC) * Hill climbing with learning (HCwL) * Estimation of multivariate normal algorithm (EMNA) * Estimation of Bayesian networks algorithm (EBNA) * Stochastic hill climbing with learning by vectors of normal distributions (SHCLVND) * Real-coded PBIL * Selfish Gene Algorithm (SG) * Compact Differential Evolution (cDE) and its variants * Compact Particle Swarm Optimization (cPSO) * Compact Bacterial Foraging Optimization (cBFO) * Probabilistic incremental program evolution (PIPE) * Estimation of Gaussian networks algorithm (EGNA) * Estimation multivariate normal algorithm with thresheld convergence *Dependency Structure Matrix Genetic Algorithm (DSMGA)


Related

*
CMA-ES Covariance matrix adaptation evolution strategy (CMA-ES) is a particular kind of strategy for numerical optimization. Evolution strategies (ES) are stochastic, derivative-free methods for numerical optimization of non-linear or non-convex continuo ...
*
Cross-entropy method The cross-entropy (CE) method is a Monte Carlo method for importance sampling and optimization. It is applicable to both combinatorial and continuous problems, with either a static or noisy objective. The method approximates the optimal importance ...


References

{{DEFAULTSORT:Estimation Of Distribution Algorithm Evolutionary computation Stochastic optimization